Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
pip install yellowbrick --user
Requirement already satisfied: yellowbrick in c:\users\rage3\appdata\roaming\python\python38\site-packages (1.4) Requirement already satisfied: scipy>=1.0.0 in c:\users\rage3\anaconda3\lib\site-packages (from yellowbrick) (1.6.2) Requirement already satisfied: cycler>=0.10.0 in c:\users\rage3\anaconda3\lib\site-packages (from yellowbrick) (0.10.0) Requirement already satisfied: numpy>=1.16.0 in c:\users\rage3\anaconda3\lib\site-packages (from yellowbrick) (1.20.1) Requirement already satisfied: scikit-learn>=1.0.0 in c:\users\rage3\anaconda3\lib\site-packages (from yellowbrick) (1.0.2) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\users\rage3\anaconda3\lib\site-packages (from yellowbrick) (3.3.4) Requirement already satisfied: six in c:\users\rage3\anaconda3\lib\site-packages (from cycler>=0.10.0->yellowbrick) (1.15.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\rage3\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.4.7) Requirement already satisfied: pillow>=6.2.0 in c:\users\rage3\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (8.2.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\rage3\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\rage3\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.3.1) Requirement already satisfied: joblib>=0.11 in c:\users\rage3\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.0.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\rage3\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.1.0) Note: you may need to restart the kernel to use updated packages.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
## Complete the code to import the data
data_st = pd.read_csv('stock_data.csv')
# creating a copy of the original data frame
df = data_st.copy()
df.shape
(340, 15)
# checking for first 5 rows
df.head(5)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
# checking for last 5 rows
df.tail(5)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 939457327.6 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.4 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 188461538.5 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 257892500.0 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 498529411.8 | 70.470585 | 1.723068 |
# checking for random rows
df.sample(10)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 23 | AMGN | Amgen Inc | Health Care | Biotechnology | 162.330002 | 17.163478 | 1.630259 | 25 | 362 | 413000000 | 6939000000 | 9.15 | 7.583607e+08 | 17.740984 | 24.012322 |
| 95 | DLPH | Delphi Automotive | Consumer Discretionary | Auto Parts & Equipment | 85.730003 | 12.109326 | 1.440884 | 64 | 14 | -325000000 | 1450000000 | 5.08 | 2.854331e+08 | 16.875985 | -0.662152 |
| 259 | PWR | Quanta Services Inc. | Industrials | Industrial Conglomerates | 20.250000 | -16.632362 | 2.954291 | 10 | 11 | -61744000 | 321824000 | 1.59 | 2.024050e+08 | 12.735849 | 4.291894 |
| 49 | BSX | Boston Scientific | Health Care | Health Care Equipment | 18.440001 | 11.757582 | 1.491764 | 4 | 13 | -268000000 | -239000000 | -0.18 | 1.327778e+09 | 31.468962 | -3.880921 |
| 209 | MMC | Marsh & McLennan | Financials | Insurance Brokers | 55.450001 | 6.022948 | 1.034162 | 25 | 99 | -584000000 | 1599000000 | 3.01 | 5.312292e+08 | 18.421927 | -1.950194 |
| 254 | PNW | Pinnacle West Capital | Utilities | MultiUtilities | 64.480003 | 0.498752 | 1.143421 | 10 | 3 | 31884000 | 437257000 | 3.94 | 1.109789e+08 | 16.365483 | -6.089228 |
| 47 | BLL | Ball Corp | Materials | Metal & Glass Containers | 72.730003 | 16.535816 | 1.386684 | 22 | 10 | 32600000 | 280900000 | 2.05 | 1.370244e+08 | 35.478050 | -3.895657 |
| 120 | EXPD | Expeditors Int'l | Industrials | Air Freight & Logistics | 45.099998 | -4.449159 | 1.062553 | 27 | 94 | -119311000 | 457223000 | 2.42 | 1.889351e+08 | 18.636363 | 5.991459 |
| 189 | LMT | Lockheed Martin Corp. | Industrials | Aerospace & Defense | 217.149994 | 5.254227 | 0.903098 | 116 | 8 | -356000000 | 3605000000 | 11.62 | 3.102410e+08 | 18.687607 | -10.852854 |
# checking for missing values
df.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
# checking for duplication in the data
df.duplicated().sum()
0
# checking for data types
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
#checking for unique values in each columns
df.nunique(axis=0)
Ticker Symbol 340 Security 340 GICS Sector 11 GICS Sub Industry 104 Current Price 336 Price Change 340 Volatility 340 ROE 82 Cash Ratio 124 Net Cash Flow 332 Net Income 335 Earnings Per Share 268 Estimated Shares Outstanding 323 P/E Ratio 295 P/B Ratio 299 dtype: int64
# Let us now look at the count of unqiue values in categorical columns GICS Sector and GICS Sub industry
cat_list = ['GICS Sector','GICS Sub Industry' ]
# Printing number of count of each unique value in each column
for column in cat_list:
print(df[column].value_counts())
print("-" * 50)
Industrials 53 Financials 49 Health Care 40 Consumer Discretionary 40 Information Technology 33 Energy 30 Real Estate 27 Utilities 24 Materials 20 Consumer Staples 19 Telecommunications Services 5 Name: GICS Sector, dtype: int64 -------------------------------------------------- Oil & Gas Exploration & Production 16 REITs 14 Industrial Conglomerates 14 Electric Utilities 12 Internet Software & Services 12 Health Care Equipment 11 MultiUtilities 11 Banks 10 Property & Casualty Insurance 8 Biotechnology 7 Diversified Financial Services 7 Oil & Gas Refining & Marketing & Transportation 6 Semiconductors 6 Packaged Foods & Meats 6 Pharmaceuticals 6 Consumer Finance 5 Diversified Chemicals 5 Health Care Facilities 5 Integrated Oil & Gas 5 Industrial Machinery 5 Managed Health Care 5 Airlines 5 Asset Management & Custody Banks 4 Retail REITs 4 Integrated Telecommunications Services 4 Railroads 4 Hotels, Resorts & Cruise Lines 4 Residential REITs 4 Internet & Direct Marketing Retail 4 Aerospace & Defense 4 Specialty Chemicals 4 Building Products 4 Soft Drinks 4 Research & Consulting Services 4 Specialized REITs 3 Specialty Stores 3 IT Consulting & Other Services 3 Cable & Satellite 3 Life & Health Insurance 3 Oil & Gas Equipment & Services 3 Restaurants 3 Insurance Brokers 3 Air Freight & Logistics 3 Regional Banks 3 Household Products 3 Health Care Distributors 3 Construction & Farm Machinery & Heavy Trucks 3 Auto Parts & Equipment 2 Data Processing & Outsourced Services 2 Broadcasting & Cable TV 2 Homebuilding 2 Automobile Manufacturers 2 Advertising 2 Fertilizers & Agricultural Chemicals 2 Leisure Products 2 Paper Packaging 2 Electronic Components 2 Health Care Supplies 2 Tobacco 2 Construction Materials 2 Investment Banking & Brokerage 2 Application Software 2 Industrial Materials 1 Networking Equipment 1 Motorcycle Manufacturers 1 Specialty Retail 1 Drug Retail 1 Water Utilities 1 Brewers 1 Apparel, Accessories & Luxury Goods 1 Personal Products 1 Office REITs 1 Industrial Gases 1 Human Resource & Employment Services 1 Agricultural Products 1 Copper 1 Real Estate Services 1 Technology, Hardware, Software and Supplies 1 Consumer Electronics 1 Home Entertainment Software 1 Semiconductor Equipment 1 Tires & Rubber 1 Home Furnishings 1 Life Sciences Tools & Services 1 Steel 1 Electrical Components & Equipment 1 Environmental Services 1 Household Appliances 1 Trucking 1 Financial Exchanges & Data 1 Publishing 1 Technology Hardware, Storage & Peripherals 1 Housewares & Specialties 1 Multi-line Insurance 1 Diversified Commercial Services 1 Electronic Equipment & Instruments 1 Computer Hardware 1 Metal & Glass Containers 1 Multi-Sector Holdings 1 Alternative Carriers 1 Distributors 1 Casinos & Gaming 1 Gold 1 Thrifts & Mortgage Finance 1 Name: GICS Sub Industry, dtype: int64 --------------------------------------------------
# Statistical description of the data
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | SPG | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Security | 340 | 340 | LyondellBasell | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net Cash Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327937 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings Per Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.0 | NaN | NaN | NaN | 577028337.754029 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
As mentioned above
# Lets look that attributes of the top company LyondellBasell as per the data description
top = df[df['Security'] == 'LyondellBasell']
top
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 194 | LYB | LyondellBasell | Materials | Diversified Chemicals | 86.900002 | 2.573183 | 1.609745 | 68 | 46 | -107000000 | 4476000000 | 9.62 | 465280665.3 | 9.033264 | 10.316354 |
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
df[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Plotting Histogram Box plot plot on all the numerial columns at once
# Creating a data frame for the numerical columns
num_col = df.select_dtypes(include = np.number)
# funtion to plot all plots at once
for col in num_col:
histogram_boxplot(df, col)
current price : The distribution shows it is heavily skewed to the right. most of the price is below $200. However there are very few above $ 200. There is one above $1200 and is an Outlier.
price change: The distribution of stock price shows almost close to normal distribution.
Volatility: The distribution shows it is skewed to the right and some show very high volatility and outliers.
ROE: The distribution shows it is heavily skewed to the right. most of the points are under 80.There are possible outliers.
Cash Ratio: The distribution shows it is skewed to the right and most of the points show the cash ratio of more than 200.
Net Cash Flow: The distribution shows normal normal distribution with possible outliers.
Net Income: The distribution shows normal normal distribution with possible outliers.
Earnings Per Share: The distribution shows normal normal distribution with possible outliers.
Estimated Shares Outstanding: The distribution shows it is heavily skewed to the right and some of the points show that few compnaies have billion outstanding shares.
P/E Ratio: The distribution shows it is heavily skewed to the right and some of the points show few companies have a P/E ration greater than 100.
P/B Ratio: The distribution shows normal normal distribution with possible outliers.
labeled_barplot(df, 'GICS Sector', perc=True)
plt.show()
labeled_barplot(df, 'GICS Sub Industry', perc=True)
plt.show()
# Histograms to understand the distribution of the data
df.hist(figsize=(20,15),bins=10,color="red",xlabelsize=20, ylabelsize=10,xrot=45 )
plt.show()
sns.pairplot(df,diag_kind="kde")
plt.show()
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
histogram_boxplot(df,'Current Price')
plt.figure(figsize=(10,6))
sns.barplot(data=df, x='GICS Sector', y='Price Change', ci=False)
plt.xticks(rotation=90)
plt.show()
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
plt.figure(figsize=(10,6))
sns.barplot(data=df, x='GICS Sector', y='Cash Ratio', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(10,6))
sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()
# creating a copy of the data
df1 = df.copy()
plt.figure(figsize=(15, 12))
num_col_out = df1.select_dtypes(include=np.number)
for i,column in enumerate(num_col_out):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[column], whis=1.5)
plt.tight_layout()
plt.title(column)
trans= StandardScaler()
subset = num_col_out.copy()
subset_scaled = trans.fit_transform(subset)
# creating a dataframe of the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
# Now that the data has been scaled. let us now check for scaled Axis of the data using histogram plots
# Histograms to understand the distribution of the data
subset_scaled_df.hist(figsize=(20,15),bins=10,color="red",xlabelsize=20, ylabelsize=10,xrot=45 )
plt.show()
# funtion to plot all plots at once
for col in subset_scaled_df:
histogram_boxplot(subset_scaled_df, col)
plt.show()
sns.heatmap(
subset_scaled_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
sns.pairplot(subset_scaled_df,diag_kind="kde")
plt.show()
# Creating a new data frame with the scaled data to perform K-Means clustering.
k_means_df = subset_scaled_df.copy()
clusters = range(1,15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2692367155390745 Number of Clusters: 4 Average Distortion: 2.1745559827866363 Number of Clusters: 5 Average Distortion: 2.128799332840716 Number of Clusters: 6 Average Distortion: 2.080400099226289 Number of Clusters: 7 Average Distortion: 2.0289794220177395 Number of Clusters: 8 Average Distortion: 1.964144163389972 Number of Clusters: 9 Average Distortion: 1.9221492045198068 Number of Clusters: 10 Average Distortion: 1.8513913649973124 Number of Clusters: 11 Average Distortion: 1.8024134734578485 Number of Clusters: 12 Average Distortion: 1.7900931879652673 Number of Clusters: 13 Average Distortion: 1.7417609203336912 Number of Clusters: 14 Average Distortion: 1.673559857259703
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
plt.show()
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457) For n_clusters = 3, the silhouette score is 0.4644405674779404) For n_clusters = 4, the silhouette score is 0.4577225970476733) For n_clusters = 5, the silhouette score is 0.43228336443659804) For n_clusters = 6, the silhouette score is 0.4005422737213617) For n_clusters = 7, the silhouette score is 0.3976335364987305) For n_clusters = 8, the silhouette score is 0.40278401969450467) For n_clusters = 9, the silhouette score is 0.3778585981433699) For n_clusters = 10, the silhouette score is 0.13458938329968687) For n_clusters = 11, the silhouette score is 0.1421832155528444) For n_clusters = 12, the silhouette score is 0.2044669621527429) For n_clusters = 13, the silhouette score is 0.23424874810104204) For n_clusters = 14, the silhouette score is 0.12102526472829901)
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(9, random_state=1)) ## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_df)
visualizer.show()
plt.show()
# final K-means model
kmeans = KMeans(n_clusters=4, random_state=1)
kmeans.fit(k_means_df)
KMeans(n_clusters=4, random_state=1)
# creating a copy of the original data
df2 = df1.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["KM_segments"] = kmeans.labels_
df2["KM_segments"] = kmeans.labels_
km_cluster_profile = df2.groupby("KM_segments").mean()
df2.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | KM_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 0 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 0 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 3 |
km_cluster_profile["count_in_each_segment"] = (
df2.groupby("KM_segments")["Security"].count().values
)
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KM_segments | ||||||||||||
| 0 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -14046223.826715 | 1482212389.891697 | 3.621029 | 438533835.667184 | 23.843656 | -3.358948 | 277 |
| 1 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1072272727.272727 | 14833090909.090910 | 4.154545 | 4298826628.727273 | 14.803577 | -4.552119 | 11 |
| 2 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -159428481.481481 | -3887457740.740741 | -9.473704 | 480398572.845926 | 90.619220 | 1.342067 | 27 |
| 3 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1554926560.000000 | 1572611680.000000 | 6.045200 | 578316318.948800 | 74.960824 | 14.402452 | 25 |
## lets print the companies in each cluster
for cl in df2["KM_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["KM_segments"] == cl]["Security"].unique())
print()
In cluster 0, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 3, the following companies are present: ['Analog Devices, Inc.' 'Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc' 'Amazon.com Inc' 'Bank of America Corp' 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Halliburton Co.' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] In cluster 2, the following companies are present: ['Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] In cluster 1, the following companies are present: ['Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
# lets us now look at the count of sectors, segment /cluster wise
df2.groupby(['KM_segments','GICS Sector'])['Security'].count()
KM_segments GICS Sector
0 Consumer Discretionary 33
Consumer Staples 17
Energy 6
Financials 45
Health Care 29
Industrials 52
Information Technology 24
Materials 19
Real Estate 26
Telecommunications Services 2
Utilities 24
1 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 2
Information Technology 1
Telecommunications Services 2
2 Energy 22
Industrials 1
Information Technology 3
Materials 1
3 Consumer Discretionary 6
Consumer Staples 1
Energy 1
Financials 1
Health Care 9
Information Technology 5
Real Estate 1
Telecommunications Services 1
Name: Security, dtype: int64
fig, axes = plt.subplots(3, 4, figsize=(20, 20))
counter = 0
for ii in range(3):
for jj in range(4):
if counter < 11:
sns.boxplot(
ax=axes[ii][jj],
data=df2,
y=df2.columns[4+counter],
x="KM_segments",
)
counter = counter + 1
fig.tight_layout(pad=3.0)
# ceating a copy of the data
hc_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ['euclidean', "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single" , "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df , metric=dm , method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524591. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159737. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
### checking all the linkage methods with Euclidean distance as the metric
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 4 | ward | 0.710118 |
| 1 | complete | 0.787328 |
| 5 | weighted | 0.869378 |
| 0 | single | 0.923227 |
| 3 | centroid | 0.931401 |
| 2 | average | 0.942254 |
# Considering a pair plot to have an idea of no of clutsers to consider
sns.pairplot(subset_scaled_df,diag_kind="kde")
plt.show()
# we are using the metric = Euclidean distance, method = Average linkage and N_clusters = 4 to bulid our model
# Building a model with the best paramters selected
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="average")
# fitting the model on the data
HCmodel.fit(hc_df)
AgglomerativeClustering(linkage='average', n_clusters=4)
# creating a cooy of the data
df3 = df.copy()
# creating a new column label to the scaled data frame and original data frame
hc_df["HC_segments"]= HCmodel.labels_
df3["HC_segments"]= HCmodel.labels_
# grouping the cluster labels
hc_cluster_profile = df3.groupby("HC_segments").mean()
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | |||||||||||
| 0 | 77.573266 | 4.148438 | 1.515708 | 35.184524 | 67.154762 | 6.710469e+07 | 1.607391e+09 | 2.90564 | 5.723178e+08 | 32.325679 | -1.762402 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.09000 | 5.093552e+07 | 25.453183 | -1.052429 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1.292500e+09 | -1.910650e+10 | -41.81500 | 5.195740e+08 | 60.748608 | 1.565141 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 5.920000e+08 | 3.669000e+09 | 1.31000 | 2.800763e+09 | 79.893133 | 5.884467 |
# adding a new column to show the count of each group
hc_cluster_profile["count_in_each_segment"] = (
df3.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 77.573266 | 4.148438 | 1.515708 | 35.184524 | 67.154762 | 6.710469e+07 | 1.607391e+09 | 2.90564 | 5.723178e+08 | 32.325679 | -1.762402 | 336 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.09000 | 5.093552e+07 | 25.453183 | -1.052429 | 1 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1.292500e+09 | -1.910650e+10 | -41.81500 | 5.195740e+08 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 5.920000e+08 | 3.669000e+09 | 1.31000 | 2.800763e+09 | 79.893133 | 5.884467 | 1 |
# building a new model with linkage as complete based
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="complete")
# fitting the model on the data
HCmodel.fit(hc_df)
AgglomerativeClustering(linkage='complete', n_clusters=4)
# creating a new column label to the scaled data frame and original data frame
hc_df["HC_segments"]= HCmodel.labels_
df3["HC_segments"]= HCmodel.labels_
# grouping the cluster labels
hc_cluster_profile = df3.groupby("HC_segments").mean()
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | |||||||||||
| 0 | 77.505788 | 4.194932 | 1.513252 | 34.744745 | 69.567568 | -3.391204e+07 | 1.592498e+09 | 2.995961 | 5.676948e+08 | 32.402078 | -2.162612 |
| 1 | 109.283336 | 9.555034 | 1.253895 | 18.333333 | 95.333333 | 1.120063e+10 | 9.301514e+09 | 5.166667 | 1.883537e+09 | 19.446599 | 44.027184 |
| 2 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.090000 | 5.093552e+07 | 25.453183 | -1.052429 |
| 3 | 26.990000 | -14.060688 | 3.296307 | 603.000000 | 57.333333 | -5.850000e+08 | -1.755567e+10 | -39.726667 | 4.819101e+08 | 71.528835 | 1.638633 |
# adding a new column to show the count of each group
hc_cluster_profile["count_in_each_segment"] = (
df3.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 77.505788 | 4.194932 | 1.513252 | 34.744745 | 69.567568 | -3.391204e+07 | 1.592498e+09 | 2.995961 | 5.676948e+08 | 32.402078 | -2.162612 | 333 |
| 1 | 109.283336 | 9.555034 | 1.253895 | 18.333333 | 95.333333 | 1.120063e+10 | 9.301514e+09 | 5.166667 | 1.883537e+09 | 19.446599 | 44.027184 | 3 |
| 2 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.090000 | 5.093552e+07 | 25.453183 | -1.052429 | 1 |
| 3 | 26.990000 | -14.060688 | 3.296307 | 603.000000 | 57.333333 | -5.850000e+08 | -1.755567e+10 | -39.726667 | 4.819101e+08 | 71.528835 | 1.638633 | 3 |
# building a new model with linkage as single based
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="single")
# fitting the model on the data
HCmodel.fit(hc_df)
AgglomerativeClustering(linkage='single', n_clusters=4)
# creating a new column label to the scaled data frame and original data frame
hc_df["HC_segments"]= HCmodel.labels_
df3["HC_segments"]= HCmodel.labels_
# grouping the cluster labels
hc_cluster_profile = df3.groupby("HC_segments").mean()
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | |||||||||||
| 0 | 76.846312 | 4.052844 | 1.525345 | 37.053412 | 69.789318 | 5.865072e+07 | 1.568163e+09 | 2.807908 | 5.806745e+08 | 32.459017 | -2.128154 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.090000 | 5.093552e+07 | 25.453183 | -1.052429 |
| 2 | 44.470001 | 11.397804 | 2.405408 | 917.000000 | 80.000000 | 6.980000e+08 | -2.352800e+10 | -61.200000 | 3.844444e+08 | 93.089287 | 4.970809 |
| 3 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 9.088500e+07 | 5.965410e+08 | 8.910000 | 6.695185e+07 | 31.040405 | 129.064585 |
# adding a new column to show the count of each group
hc_cluster_profile["count_in_each_segment"] = (
df3.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 76.846312 | 4.052844 | 1.525345 | 37.053412 | 69.789318 | 5.865072e+07 | 1.568163e+09 | 2.807908 | 5.806745e+08 | 32.459017 | -2.128154 | 337 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.090000 | 5.093552e+07 | 25.453183 | -1.052429 | 1 |
| 2 | 44.470001 | 11.397804 | 2.405408 | 917.000000 | 80.000000 | 6.980000e+08 | -2.352800e+10 | -61.200000 | 3.844444e+08 | 93.089287 | 4.970809 | 1 |
| 3 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 9.088500e+07 | 5.965410e+08 | 8.910000 | 6.695185e+07 | 31.040405 | 129.064585 | 1 |
# building a new model with linkage as ward based
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
# fitting the model on the data
HCmodel.fit(hc_df)
AgglomerativeClustering(n_clusters=4)
# creating a new column label to the scaled data frame and original data frame
hc_df["HC_segments"]= HCmodel.labels_
df3["HC_segments"]= HCmodel.labels_
# grouping the cluster labels
hc_cluster_profile = df3.groupby("HC_segments").mean()
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | |||||||||||
| 0 | 74.162501 | -8.152437 | 2.537700 | 178.500000 | 46.437500 | -3.839509e+08 | -3.233045e+09 | -7.804688 | 4.751114e+08 | 105.662011 | -2.458849 |
| 1 | 325.996105 | 7.724708 | 1.545762 | 17.000000 | 360.333333 | 1.865032e+08 | 1.008736e+09 | 9.544167 | 5.894345e+08 | 41.121871 | 25.258283 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 5.586364e+08 | 1.463127e+10 | 3.410000 | 4.242573e+09 | 15.242169 | -4.924615 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 7.995151e+07 | 1.538594e+09 | 3.655351 | 4.464721e+08 | 24.722670 | -2.647194 |
# adding a new column to show the count of each group
hc_cluster_profile["count_in_each_segment"] = (
df3.groupby("HC_segments")["Security"].count().values
)
hc_cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 74.162501 | -8.152437 | 2.537700 | 178.500000 | 46.437500 | -3.839509e+08 | -3.233045e+09 | -7.804688 | 4.751114e+08 | 105.662011 | -2.458849 | 32 |
| 1 | 325.996105 | 7.724708 | 1.545762 | 17.000000 | 360.333333 | 1.865032e+08 | 1.008736e+09 | 9.544167 | 5.894345e+08 | 41.121871 | 25.258283 | 12 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 5.586364e+08 | 1.463127e+10 | 3.410000 | 4.242573e+09 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 7.995151e+07 | 1.538594e+09 | 3.655351 | 4.464721e+08 | 24.722670 | -2.647194 | 285 |
# let's see the names of the companies in each cluster
for cl in df3["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df3[df3["HC_segments"] == cl]["Security"].unique())
print()
In cluster 3, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 1, the following companies are present: ['Alliance Data Systems' 'Amgen Inc' 'Chipotle Mexican Grill' 'Equinix' 'Facebook' 'Frontier Communications' 'Intuitive Surgical Inc.' 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'Waters Corporation' 'Yahoo Inc.'] In cluster 0, the following companies are present: ['Allegion' 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kimberly-Clark' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.' 'S&P Global, Inc.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] In cluster 2, the following companies are present: ['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
# lets now look at the count of clusers sector wise
df3.groupby(["HC_segments", "GICS Sector"])['Security'].count()
HC_segments GICS Sector
0 Consumer Discretionary 2
Consumer Staples 2
Energy 22
Financials 1
Health Care 1
Industrials 1
Information Technology 2
Materials 1
1 Consumer Discretionary 2
Consumer Staples 1
Health Care 4
Information Technology 3
Real Estate 1
Telecommunications Services 1
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 4
Health Care 1
Information Technology 1
Telecommunications Services 2
3 Consumer Discretionary 35
Consumer Staples 15
Energy 7
Financials 44
Health Care 34
Industrials 52
Information Technology 27
Materials 19
Real Estate 26
Telecommunications Services 2
Utilities 24
Name: Security, dtype: int64
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 74.162501 | -8.152437 | 2.537700 | 178.500000 | 46.437500 | -383950875.000000 | -3233045437.500000 | -7.804688 | 475111396.795000 | 105.662011 | -2.458849 | 32 |
| 1 | 325.996105 | 7.724708 | 1.545762 | 17.000000 | 360.333333 | 186503166.666667 | 1008736416.666667 | 9.544167 | 589434517.801667 | 41.121871 | 25.258283 | 12 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 285 |
fig, axes = plt.subplots(3, 4, figsize=(20, 20))
counter = 0
for ii in range(3):
for jj in range(4):
if counter < 11:
sns.boxplot(
ax=axes[ii][jj],
data=df3,
y=df3.columns[4+counter],
x="HC_segments",
)
counter = counter + 1
fig.tight_layout(pad=3.0)
plt.show()
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced_pca = pca.fit_transform(subset_scaled_df)
# storing results in a dataframe
reduced_data_df_pca = pd.DataFrame(data = X_reduced_pca, columns = ['Component 1', 'Component 2'])
# checking the amount of variance explained
print(f'The first two principal components explain {np.round(100*pca.explained_variance_ratio_.sum(), 2)}% of the variance in the data.')
The first two principal components explain 37.13% of the variance in the data.
sns.scatterplot(data=reduced_data_df_pca, x='Component 1', y='Component 2')
plt.show()
sns.scatterplot(data=reduced_data_df_pca, x='Component 1', y='Component 2', hue=df3["HC_segments"], palette='rainbow')
plt.show()
**Group 0:
Has very high Volataltiy a higher ROE is desirable which means the company is efficiently using its shareholder's equity to generate income. It also has Higher P/E Ratio an indicator that investors are expecting higher earnings growth in the future compared to companies with a lower P/E.
However, attributes such as Net Cash Flow , Net Income and Earnings Per Share is negative according to me is a not a desired group to invest.
**Group 1: This group has high current price of the stock, high change in the price and higher P/B ratio meaning more expensive is the stock. However, it shows a higher Earnings Per Share which indicates that these stocks will yeild good returns
**Group 2: This is very moderate when it comes to current price, volatility and also moderate Earnings Per Share. P/E Ratio is makes it group to invest in and has not much of risk.
**Group 3: This group has most the attributes in moderate compared to the rest of the groups making this group less riskier and a good group to invest.
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 74.162501 | -8.152437 | 2.537700 | 178.500000 | 46.437500 | -383950875.000000 | -3233045437.500000 | -7.804688 | 475111396.795000 | 105.662011 | -2.458849 | 32 |
| 1 | 325.996105 | 7.724708 | 1.545762 | 17.000000 | 360.333333 | 186503166.666667 | 1008736416.666667 | 9.544167 | 589434517.801667 | 41.121871 | 25.258283 | 12 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 285 |
Thank you